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Abstract 

We present a method to estimate Gibbs distributions with spatio-temporal con- 
straints on spike trains statistics. We apply this method to spike trains recorded 
from ganglion cells of the salamander retina, in response to natural movies. Our 
analysis, restricted to a few neurons, performs more accurately than pairwise syn- 
chronization models (Ising) or the 1-time step Markov models (Marre et al. (2009)) 
to describe the statistics of spatio-temporal spike patterns and emphasizes the role 
of higher order spatio-temporal interactions. 

Keywords: Spike-train analysis, Higher-order correlation, Statistical Physics, Gibbs 
Distributions, Maximum Entropy 



1 Introduction 

Modern advances in neurophysiology techniques, such as two-photons imaging of cal- 
cium signals or micro-electrode arrays electro-physiology, have made it possible to 
observe simultaneously the activity of assemblies of neurons, Stevenson and Kording 
(2011). Such experimental recordings provide a great opportunity to unravel the un- 
derlying interactions of neural assemblies. The analysis of multi-cells spike-patterns 
constitutes an alternative to descriptive statistics (e.g cross-correlograms or joint peri- 
stimulus time histograms) which become hard to interpret for large groups of cells, 
Brown et al. (2004); Kass et al. (2005). Earlier multi-cells approaches, e.g., Abeles 
and Gerstein (1988), focus on synchronization patterns. Using algorithms detecting 
the most frequent instantaneous patterns in a data set, and calculating their expected 
probability, these approaches aim at testing whether those patterns were produced by 

*NeuroMathComp team (INRIA, ENS Paris, UNSA LJAD), Sophia Antipolis, France. 
INRIA, 2004 Route des Lucioles, 06902 Sophia-Antipolis, France, 
email: Juan-Carlos.Vasquez@sophia.inria.fr 

1 ' Department of Molecular Biology and Princeton Neuroscience Institute, Princeton University, USA and 
NeuroMathComp team (INRIA, ENS Paris, UNSA LJAD) 

^ Centra Interdisciplinary de Neurociencia de Valparaiso, Universidad de Valparaiso, Chile 

§ Department of Molecular Biology and Princeton Neuroscience Institute, Princeton University, USA 



1 



chance, Griin et al. (2002). This methodology relies however on a largely controversial 
assumption, namely Poisson-statistics, Pouzat and Chaffiol (2009); Schneidman et al. 
(2006). 

A second type of approach has become popular in neuroscience after works of 
Schneidman et al. (2006); Shlens et al. (2006). They used a maximum entropy ap- 
proach to model spike trains statistics by the Gibbs distribution of the Ising model. The 
parameters of this distribution are determined from the mean firing rate of each neuron 
and their pairwise synchronizations. These works have shown that for a small group 
of cells (10-40 retinal ganglion cells) the Ising model describes most (<~ 80 — 90%) 
of the statistics of the instantaneous patterns, and performs much better than a non- 
homogeneous Poisson model. 

However, several papers have pointed out the importance of temporal patterns of 
activity at the network level , Abeles et al. (1993); Lindsey et al. (1997); Villa et al. 
(1999); Segev et al. (2004a). Recently, Tang et al. (2008); Ohiorhenuan et al. (2010), 
have shown the insufficiency of the Ising model to predict the temporal statistics of 
the neural multi-cells activity. Therefore, some authors, Marre et al. (2009); Amari 
(2010); Roudi and Hertz (2010), have attempted to define time-dependent Gibbs distri- 
butions on the basis of a Markovian approach (1-step time pairwise correlations). The 
application of such extended model in Marre et al. (2009) increased the accuracy of the 
statistical characterization of data with the estimated distributions. 

In this paper we propose an extension of the maximal entropy approach to general 
spatio-temporal correlations, based on the transfer-matrix method in statistical physics, 
Georgii (1988) (section 2). We describe a numerical method to perform the estima- 
tion of the Gibbs distribution parameters from empirical data (section 3). We apply 
this method to the analysis of spike trains recorded from ganglion cells using multi- 
electrodes devices in the salamander retina (section 4). We analyse retinal spike trains 
taking into account spatial patterns of two and three neurons with triplets and quadru- 
plets terms, and temporal terms up to 4 time steps. Our analysis emphasizes the role 
of higher order spatio-temporal interactions. Section 5 contains the discussion and 
conclusions. 

2 Theoretical framework 
2.1 Spike trains and Raster Plots 

Let N be the number of neurons and denote ; = l,...,N the neuron index. Assume 
that we have discretised time in steps of size A. Without loss of generality (change 
of time units) we may set A = 1 . This provides a time discretisation labelled with an 
integer index n. We define a binary variable 0), (n) G {0, 1}, which is '1' if neuron ;' has 
emitted a spike in the n-th time interval and is zero otherwise. We use the notation (O 
to differentiate our binary variables <G {0, 1} to the notation a or S traditionally used 
for "spins" variables e { — 1,1}. The spiking pattern of the neural network at time n 
is the vector (o{n) — (o)i(n))^ =l . We denote [(o] n m the ordered sequence or spike block 
(o(m) . . . a>(n), m < n. In practice, from recordings and after applying spike sorting 
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algorithms, one obtains a sequence of spiking patterns called a raster plot. In our no- 
tations a raster plot is thus a spike block [co] where T is the total length of the spike 
time sequences, measured in A time-units. 



2.2 Observables and monomials 

We call observable a function which associates to a raster a real number. Although 
the method developed here holds for general functions, we focus on observables called 
monomials. These are functions of the form (©) = C0i l (n\ ) C0j 2 (n.2) ... 0) im (n m ) which 
is equal to 1 if and only if neuron i\ fires at time n\, neuron i m fires at time i m in the 
raster 0). Thus monomials attribute the value ' 1' to characteristic spike events. We use 
the convention that n\ < ri2 < ■ ■ ■ < n m . Then, the range of a monomial is n m — n\ + 1. 

A typical monomial is 0(a)) = 0,(0) which is equal to '1' if neuron ;' spikes at 
time in the raster CO and is '0' otherwise. This a function of a single event, of range 
1. Likewise 0(a)) = 0) ( (0) fi)/(0) is '1' if and only if neuron ; and j fire synchronously 
at time in the raster CO. This is a function of pairwise event, of range 1 too. As a last 
example, 0(a)) = COi (0) 0)2(1) 0)3(2) 0)4(5) is a function of a quadruplet of spikes, of 
range 6. 



2.3 Hidden probability 

Collective neuron dynamics, submitted to noise, produces spike trains with random- 
ness, although some statistical regularity can be observed. The spike trains statistics 
are assumed to be characterized by an hidden probability jx h giving the probability of 
spatio-temporal spike patterns. A current goal in experimental analysis of spike trains 
is to approximate [i h from data. A model is a probability distribution jU which ap- 
proaches [i h . We give a precise meaning of approaching a probability by another one 
below. Typically, jj, must predict the probability of spike blocks occurrence with a good 
accuracy. 

Given a model fi we note fi [0] the average of an observable with respect to jj,. For 
example the average value of (o)) = 0), (n) is given by jj. [a)/(n)] = £»,(«) 0), (n) jj, [»;(«)] 
where the sum holds on all possible values of 0), («) (0 or 1). Thus, finally jj, [0] = 
jU [o), (n) = 1] is nothing but the probability of firing of neuron i at time n, predicted by 
the model jj,. Likewise, the average value of co iy (n) 0), 2 (n) is the predicted probability 
that neuron i\ and z'2 fire at the same time n: this is a measure of pairwise synchroniza- 
tion. More generally, for the monomial = C0i l («i ) 0); 2 (/12) • ■ • 0), m (n m ), jl [0] is the 
predicted probability of occurrence of the event "neuron i\ fires at time n\, neuron 
i m fires at time i m ". 

We assume here, as in most papers dealing with spike train statistics, that hidden 
statistics are stationary so that the average value of functions is time-translation invari- 
ant. As a consequence we consider time-translation invariant models (e.g., jj, [o), («) = 1] 
is independent on n). 
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2.4 Time-average 

Given an experimental raster co of duration T, and an observable (j> we note 7C^ [0] 

the time-average of 0. For example, when <j)((o) = *»,(«), n m [<j>] — j L„Jo G) i( n ) 
provides an estimation of the firing rate of neuron i (it is independent of time from the 
stationarity assumption). If is a monomial (0^ {n\) ... (0 im (n m ), 1 < «i < «2 < n m < T 
then n®^ [<j>] = T } n L^Jo" 1 a h («i +") • • • &>; m (n m + n), and so on. We use the cumber- 

some notation n a [<j>] to remind that such time averages are random variables. They 
fluctuate from one raster to another and the amplitude of those fluctuations depend on 
T. We assume ergodicity which is a common hypothesis in this field. Then, for any 
observable <p, [<j>] — > pi h [<p] as T — > where the limit is independent of the raster 

CO. 



2.5 Gibbs distribution 

Fix a set of observables {0/}f =1 whose time average [0/] has been measured and 
is equal to C;. To match those empirical statistics, the model fx has to satisfy: 

At M = tz { J ] [0,] =Q, / = 1,...,L. (1) 

This is a minimal, but insufficient requirement, since one can construct infinitely many 
probability distributions satisfying the constraints (1). 

However, with the additional requirement that the model has to "Maximize the sta- 
tistical entropy under the constraints (1)", a unique model is selected. This is the max- 
imal entropy principle, Jaynes (1957) that amounts to solving, i.e. find the maximum 
of: a variational principle: 

P(Y)= sup (A[v] + vM). (2) 

The term \j/ defined by: 

V=I></> Z , (3) 

i=i 

is called a potential. The Xi are free parameters (Lagrange multipliers). \jf is thus a 
linear combination of the observables defining the constraints (1). The supremum in 
(2) is taken over m^ m \ the set of time-translation invariant (stationary) probabilities on 
the set of rasters for N neurons, h is the entropy rate, see Ruelle (1969, 1978); Keller 
(1998); Chazottes and Keller (2009) for the general definition. 
A probability fx which realizes the supremum (2), i.e., 

P(Y)=h[n}+n[Y}, (4) 

is called a Gibbs distribution. This name has its roots in statistical physics and we 
discuss this connection in the next paragraph. The term P(\j/), called the topological 
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pressure in this context is the formal analog of a thermodynamic potential (free energy 
density). It is a generating function for the cumulants of y/. In particular; 

Let us summarize what we have just obtained. To a set of experimental constraints, 
associated with a set of observables {<j>i}f =l , one associates a probability distribution jj,, 
called a Gibbs distribution, parametrized by the potential (3), a linear combination of 
0/'s. Now, comparing equations (1) and (5), one sees that the free parameters A/ can be 
adjusted so that the Gibbs distribution jj. matches the constraints (1). We will explain 
how this computation can be done in section 3. It turns out that P(y/) is a convex 
function. Therefore, there is a unique set of Xi so that jx matches the constraints (1). 
Hence the maximal entropy principle provides a unique statistical model matching the 
experimental constraints (1). 

Note that fx depends on y/, thus (i) on the choice of observables; (ii) on the param- 
eters X[. However, we drop this dependence in the notation to ease legibility. 



2.6 A remark. Links with previous approaches 

The maximal entropy principle is commonly used in statistical physics and has been 
applied by several authors for spike trains analysis, Schneidman et al. (2006); Tkacik 
et al. (2009, 2010); Schaub and Schultz (2010); Ganmor et al. (2011a,b). Here we 
would like to insist on the main difference between our approach and the one of these 
authors. 

In those references, constraints correspond to simultaneous spike events (mono- 
mials of the form a>i l (n) ... a>i m (n)) corresponding to spatial patterns. On the opposite, 
our observables 0/ correspond to spatio-temporal events so that y/ depends on the raster 
plot over a (finite) time horizon R, i.e. y/~(ft)) = ^([o)]^ -1 ). We speak of "range-/? po- 
tentials". Thus, our method imposes constraints on general spatio-temporal events 
instead of focusing on spatial constraints. 

The difference is not anecdotic. Imposing spatio-temporal constraints amounts to 
considering a statistical model in which the probability of a spiking pattern depends on 
the past history: the system has a memory and its actual state depends on its past via 
a set of causal spatio temporal relations. Typically, this is described by a Markovian 
process, (although non Markovian dynamics also occur in neural networks models, 
Kravchuk and Vidybida (2010); Cessac (2011a,b)). The Markovian case has been con- 
sidered by several authors in the field of spike statistics analysis, but with one time step 
memory only, and under assumptions such as detailed balance, Marre et al. (2009) or 
conditional independence between neurons, see eq. (1) in Roudi and Hertz (201 1). 

The method introduced here does not use these assumptions and allows us to con- 
sider, on a theoretical ground, general spatio-temporal constraints. It is based on a 
mathematical object called, in statistical physics, "transfer matrix" Georgii (1988) and 
in ergodic theory "Ruelle-Perron-Frobenius operator", Bowen (1975); Ruelle (1978); 
Meyer (1980). Although this method extends to non-Markovian dynamics, Cessac 
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(201 la,b), in the present paper, we restrict to finite memory. In this restricted case, this 
method has its roots in matrix representation of Markov chains and Perron-Frobenius 
theorem, Gantmacher (1998); Seneta (2006). So this method is well known but, to our 
knowledge, it is the first time that it is applied to the analysis of spike trains. 

Since we focus on Markovian dynamics here, Gibbs distribution could also be in- 
troduced in this setting, see Cessac and Palacios (2011) for a didactic presentation in 
the realm of spike train analysis. However, the advantage of the presentation adopted 
here is its compactness compatible with the limited allowed space of the paper. 



3 Estimation of Gibbs Distributions 

Let us now show how P(y) and fx, the main objects of our approach, can be computed. 

3.1 The transition matrix 

The range R of the potential \j/ is the maximum of the ranges of monomials defining 
If R = 1 the potential depends only on simultaneous events and corresponds to 
considering a memory-less process as a model. On the opposite, if R > 1 the potential 
accounts for spatio-temporal events and corresponds to taking into account memory 
and time-causality in the model. 

We assume here that R > 1 and come back to the case R = 1 below. The starting 
point is to consider that a block [cq]" +r ~ 1 is a transition from a block [cq]1 +r ~ 2 , of 
range R — 1, to a block °f ran g e R—l too. Therefore, the two blocks overlap 

(the sequence [o]"^ _1 is common to both blocks). It is useful to choose a symbolic 
representation of spike blocks of range R — 1. Indeed, there are M = 2 /v ( R ~ 1 ) such 
possible spike blocks, requiring, to be represented, N(R— 1) symbols ('0"s and T's). 
Instead, we associate to each block [o]JJ +R_1 an integer: 

w n= I l2 i+A "- £ » i (n + r). (6) 

r=0 (=0 

We write w„ <~ [co]^ +s ~ 1 . Now, for integer m, n such that m <n,n — m>R,a spike se- 
quence [co]^ = (o(m) 0)(m+ 1) ... (o(m+R— 1) . . . 0)(n) can be encoded as a sequence 
of integers w m ,w m+ i . . . w n -R + \ . Clearly, this representation introduces a redundancy 
since successive blocks w n ,w n+ \ have a strong overlap. But what we gain is a con- 
venient matrix representation of the spike trains process. Note that each symbol w„ 

belongs to However, when encoding spike trains by a sequence of 

such symbols, we cannot have any possible succession of symbols w n ,w n+ \. Indeed, 
the corresponding blocks must overlap (they must have the sequence [<o]"+f _1 in com- 
mon). We say that the succession w ni w„ + \ is legal if the corresponding blocks overlap. 

For two integers w',w € we define the transition matrix 

with entries: 

eVw'w if w ',w is legal 



^AW) = \ n otherwise. ■ (7) 
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where \j/ w i w stands for y/([ft)] ). Indeed, for a legal transition w',w, fixing w',w is 
equivalent to fixing the block [ffl]^ - as well. 

Remark. J£?(y) is a huge matrix (with 2 A, < ff ~ 1 > x 2 N< - R -^ symbols). However, 

1. This is a sparse matrix. Indeed, on a each row, there are at most 2 N non-zero 
entries. 

2. If, instead of considering all possible symbols, one restricts to symbols (blocks) 
effectively appearing in an experimental raster, the dimension is considerably 
reduced. 

3.2 The Perron-Frobenius theorem 

Since, Jzf (y/) is a positive matrix it obeys the Perron-Frobenius theorem, Gantmacher 
(1998); Seneta (2006). Instead of stating it in its full generality, we give it under the 
assumption that the «5f (y/) is primitive, i.e. 3n > 0, s.t. Vw,w' Jz?^, w {y) > 0. This as- 
sumption holds for Integrate and Fire models with noise and is likely to hold for more 
general neural networks models where noise renders dynamics ergodic and mixing. 
Then, the Perron-Frobenius theorem states that Jzf (\jf) has a unique real positive max- 
imal eigenvalue s(\j/) associated with a right eigenvector \b) and a left eigenvector (b\ 
such that Jz? (y)\b) = s(\j/)\b), and (b\Jf (y/) = s(\j/)(b\. Those vectors can be chosen 
such that the scalar product (b \ b) = 1. The remaining part of the spectrum is located 
in a disk in the complex plane, of radius strictly lower than s(\j/). 

It can be shown, Bowen (1975); Ruelle (1978); Keller (1998), that the topological 
pressure P(y/) is: 

P(W) = log S (Y). (8) 
Moreover, the Gibbs distribution fi is: 

H = \b)(b\, (9) 

i.e. the probability of a spike block ~ w of range R — 1 is 

jii(w) = \b w )(b w \, 

where \b w ) is the w-th component of \b). So we have a simple way to compute the 
topological pressure and the Gibbs distribution by building the transition matrix of the 
model. Note that we don't have to compute a partition function. 

3.3 The case R = 1 

Our method can be applied to this case as well, although other methods for range 1 
potentials have been applied in the literature and are more efficient Schneidman et al. 
(2006); Tkacik et al. (2009). In our setting, a range-1 potential \j/ depends on w' only 
and the matrix ^f(y/) has constant non zero coefficients e^' on each raw. Then, it 
is straightforward to check that this matrix has N —I eigenvalues equal to 0, while 
the largest one is Z = Lw' e "'' > % tne partition function of a lattice model with potential 
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y/. So = logZ. Likewise the left eigenvector is (b\ = (1,. .. , 1), while the right 
eigenvector has entries = e^ . Thus, the corresponding probability is jU w = ge v ' > ", 
the Gibbs distribution on a lattice, with potential y/. 

3.4 Comparing several Gibbs statistical models 

The choice of a potential (3), i.e. the choice of a set of observables, fixes a statistical 
model. Since, there are many choices of potentials one needs to propose a criterion to 
compare them. 

The Kullback-Leibler (KL) divergence c1kl(IJ-,v) provides some notion of asym- 
metric "distance" between two probabilities, jj. and v. The computation of (Ikl(H, v ) 
is numerically delicate but, in the present context, the following holds. For v a time- 
translation invariant probability and jj. a Gibbs measure with a potential y/, one has, 
Keller (1998); Chazottes and Keller (2009): 

d K L(v,n)=P{\i/)-v[\i/]-h(v). 

This allows to estimate the divergence of our model fi to the hidden probability 
fl h , providing the exact spike train statistics. The smaller the quantity <i/sx(M ft >M) = 
P(y/) - jit' 1 [y/] -h(fi h ), the better is the model. Obviously, since fi h is unknown this 
criterion looks useless. However, 

1 . As stated in the section "Time average", \i h [ y/] is well approximated by 71®^ [ v] = 

Yd=\ h^a^ [0/]> where, by definition [(pi] = Q. Therefore, \i h [ y/] ~ Yk=\ h Q> 
where ~ means that the right-hand side approaches the left-hand side as T — >• °°. 
More precisely, the distance between the two quantities converges to as ^ 
Bowen (1975); Ruelle (1978); Georgii (1988). 

2. The entropy h(jj, h ) is unknown and its estimation by numerical algorithms be- 
comes more and more cumbersome and unreliable as the number of neuron 
increases, Grassberger (1989); Schiirmann and Grassberger (1996); Gao et al. 
(2008). However, when comparing two statistical models pL\,pL2 with potentials 

y/i , y/2, for the analyse the same data, h{n) is a constant since it only depends on 

It) 

data. Thus, comparing these two models amounts to comparing P [y/i] — Tim 
and P [y/2] - Km*' Wi]- Introducing 

L 

h[Y]=P [w] - 4P M = P [Y] ~ L hC u (10) 

1=1 

(where the A/ depend on the potential via (5)), the comparison of two statistical 
models y/i, y/2, i.e. determining if model y/2 is significantly "better" that model 
y/i , reduces to the condition: 

/z[y/ 2 ]«MVi]- (11) 

The advantage of (10) (sometimes called "cross-entropy") compared to the KL di- 
vergence is that we have removed the entropy, which is subject to huge fluctuations 
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when determined numerically from a finite raster with many neurons. Thus, (10) is 
less sensitive to statistical bias. What we loose, is an absolute criterion for model com- 
parison. We can just say that a model is a better than another one but we cannot say 
how close we are from the hidden probability. For this latter purpose, we estimate the 
entropy explicitly using the method proposed by Strong et al. (1998). An example is 
given below, for a small number of neurons. 

3.5 Numerical implementation 

Let us now briefly discuss how to numerically estimate the Gibbs distribution. For 
details see Vasquez et al. (2010). The code is available at http : //enas . gf orge . 
inria.fr/. The algorithmic procedure proposed decomposes in three steps. 

• Choosing a statistical model, i.e. choosing a guess potential y/ = Yli=i A/0/ or 
equivalently, a set of observables. 

• Computing the time averages Q. To compute the time-average we use a data 
structure of tree type, with depth R and degree 2 N , see e.g., Grassberger (1989) 
for a formal introduction. The nodes count the number of occurrences of blocks 
encountered in the raster. Thus, we do not store explicitly blocks of occurrence 
zero. Moreover, when comparing the distributions for distinct ranges R we can 
count in one pass, and in a unique data structure, block of different ranges. 

• Performing the parametric estimation. The parametric estimation aims at 
finding the A/ minimizing (10), by calculating the topological pressure. Note 
that, from (10), finding a point where h is extremal is equivalent to solving (5). 
Additionally, P(y/) is convex, thus h is convex too as a linear combination of 
convex functions. Thus, there is a unique minimum corresponding to the solu- 
tion of (5). 

We start with a random guess for the A/, and then iterate the following steps: 

1. Build the matrix Jzf (y/) from the values of A/ and equation (7). 

2. Compute the eigenvectors (b\ , \b) of Jzf ( y/) and the highest eigenvalue s(\j/) 
using a standard power- method series. 

3. From this eigenvalue, compute the topological pressure. This gives h. 

4. From the left and right eigenvectors, we have the Gibbs distribution jj, cor- 
responding to this set of parameters A/. One then computes the average 
value of 0/ under jj,, jj, [0/] Now, from (5) the derivative of P(y) with re- 
spect to Xi is exactly fx [0/]. This provides an exact expression for the gra- 
dient of 

5. To update the A/ toward the minimum of h, we have tried several meth- 
ods. The most efficient are based on gradient algorithms where the gra- 
dient of P(\)f) is exactly known from the previous step. The most ef- 
ficient method seems to be the Fletcher-Reeves conjugate gradient algo- 
rithm from the GSL http : / /www . gnu . org/ sof tware/gsl, while 
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other methods such as the Polak-Ribiere conjugate gradient algorithm, and 
the Broyden-Fletcher-Goldfarb-Shannon quasi-Newton method appeared 
to be less efficient. We have also used the GSL implementation of the 
simplex algorithm of Nelder and Mead which does not require the ex- 
plicit computation of a gradient. This alternative is usually less efficient 
than the previous methods. All these methods are available in our library 
http : / / enas . gf orge . inria. f r /. 

6. Repeat the previous steps until h attains its minimum. 

4 Analysis of Biological data 

4.1 Methods 

Retinae from the larval tiger salamander (Ambystoma tigrinum) were isolated from 
the eye, placed over a multi-electrode array and perfused with oxygenated Ringer's 
medium at room temperature (22 °C). Extracellular voltages were recorded by a micro- 
electrode array and streamed to disk for offline analysis. Spike sorting was performed 
as described earlier in (Segev et al. (2004b)) to extract 40 cells. The stimulus was a 
natural movie clip showing a woodland scene. The 20-30 s movie segment was re- 
peated many times. All visual stimuli were displayed on an NEC FP1370 monitor and 
projected onto the retina using standard optics. The mean light level was 5 lux, corre- 
sponding to photopic vision. The total recording time was around 3200s with sampling 
frequency of 10000Hz. 

4.2 Analysis of spike train-statistics 

We have used the recorded spike trains of retinal ganglion cells to fit models with 
different sets of constraints. 

The Linear model has a potential y/(o)) = £,- = ijv A; O),(0) (thus constraints are only 
imposed on firing rates). The corresponding Gibbs probability is a Bernoulli distribu- 
tion where spikes are independent. For a fixed range R, we call All-R a potential con- 
taining all possible and non redundant monomials of range R. For example, a monomial 
containing products of the form (of{n), k> 1 is redundant since 0)f(n) = ©, («). The 
next equation shows for clarity the potentials Linear, All-1, All-2 for a pair of neurons. 
Note that for a pair of neurons All-1 coincides with the usual Ising statistical model but 
for a triplet of neurons it contains an extra triplet synchronization term. 

linear : y/(co) = X\ Ot)i(0) + A2 ©2(0). 
All-1 : y/(o)) = AiO)i(0) + A 2 0)2(0) + A 3 0)1(0)0)2(0). 
All-2 : ip(fi>) = Ai 0)1 (0)+A 2 0)2(0)+ A 3 0)1 (0)0)2(0) 

+A 4 0)1 (0) 0)1 (1)+ A 5 0)2(0) ©2(1) (12) 
+ A 6 0)1 (0) 0)2 ( 1 ) + A 7 0)1 ( 1 ) 0)2 (0) 

+ A 8 0)1 ( 1 ) 0)2 (0) 0)2 ( 1 ) + Ag 0)i (0) 0)1 ( 1 ) 0)2 (0) 
+ A10 0)1 (0)0)2(0) 0)1 (1)0)2(1). 
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For the model estimation we bin the spike trains using bin sizes of 10 ms (we ob- 
tain similar results with larger bin sizes). We estimate the model parameters and the 
Kullback-Leibler divergence between the model distribution and the empirical distri- 
bution. To have an error bar on the latter, we divide the raster in 15 equal subsets, and 
we randomly pick 13 subsets, on which we estimate the d^L- This process is repeated 
many times (more than 100) and we estimate the error bar from the distribution of the 
djcL values obtained. 
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Fi gure 1; The KL divergence between empirical distribution computed from observed data and the 
distribution of the estimated model (different models are shown for several pairs, and error bars are included), 
"nats" means "natural units" (the KL divergence is divided by log2). (Left) This figure depicts d^L for the 
Linear model and All-R from R = 1 to R = 3. Note that for a cell pair, All- 1 model corresponds to the 
pairwise Ising model. (Right) The iIkl divergence for models of range R = 3 are examined versus the role of 
pairs, then triplets and finally the full set of terms that constitute the All-3 model. The KL divergence of the 
linear model is included for comparison. 
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Figure 2: 5-fold cross-validation for over-fitting of the cross-entropy (written h, on the figure). For a 
single pair, the y-axis shows the cross-entropy estimated on the same models than in the previous figure, for 
both training and testing sets. 
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We first focus on the statistics of spiking patterns using models with range from 1 up 
to 3. The KL divergence between the empirical Distribution and the model is depicted 
in figure 1, for 3 examples of pairs. Figure 1 (left plot) shows the effect of including all 
interaction terms within the chosen number of time bins. Increasing in the hierachy of 
models, from Linear to All-3 shows significant improvements. Naturally, the number 
of possible interaction terms explodes combinatorially with the range of the models. 
Therefore, we estimate the impact of adding higher order interaction terms in a range 
3 model. Figure 1 (right plot) shows that, although the largest improvement happens 
when adding the pairwise terms, adding triplets interactions also allows a significant 
decrease of cIkl- Beyond third order, we did not see any improvement. 

One could think that the improvement shown when adding monomials to the model 
is due to overfitting. To discard this hypothesis, we divide the raster in 5 subsets, fit 
the model with 4 of them, and compute the cross-entropy between the model and the 
fifth subset. We then change the tested subset and repeat the calculation to obtain 
error bars. Figure 2 shows that there is no difference in the mean value of the cross- 
entropy between the training and testing sets, and the error bar is still smaller than 
the difference between the models. So the improvement we see when adding terms is 
significant. Note that using the cross-entropy h, instead of Kullback-Leibler divergence 
cIkl, eliminates the effects of using biased entropy estimators, as pointed out in section 
3.4. 

The results depicted above for the three pairs can be generalized to any pair ran- 
domly selected among the cells available. We estimate the improvement gained from 
the model of range 1, similar to Schneidman et al. (2006), or range 2, Marre et al. 
(2009), to the model of range 3, quantified by the difference of cIkl between the data 
and the range 1 or 2 model, and between the data and the range 3 model. We randomly 
pick 100 pairs of cells and estimate these cIkl differences. Fig. 3 shows the histogram 
of differences, 8cIkl , between cIkl for an All-1 model and an All-3 model (left), as 
well as for an All-1 model and an All-2 model (right). Note that it is equivalent to 
consider 8h or 8(Ikl since the term h(jj.) cancels when taking the difference. The av- 
erage value of 8cIkl for an All-1 model and an All-3 model (Fig. 3 left) is 0.012 with 
a standard deviation 0.01 while the average value of 8/Jkl f° r an All-1 model and an 
All-2 model (Fig. 3 right) is 0.0056 with a standard deviation 0.004. In the former 
case, the difference ScIkl can be more than 0.04. So our range 3 model improves the 
statistical description of the data compared to previously used ones, confirming the re- 
sult observed on individual pairs. The amount of improvement is highly heterogeneous 
depending on the pair chosen. 

Can these models predict statistics on which they were not fitted? To answer that 
question, we estimate the rate of spiking pattern of two neurons and 1 to 4 time bins. 
Figure 4 shows the empirically-observed pattern rate against the pattern rate predicted 
by each model All-1, All-2 and All-3. Each point corresponds to a spike block. It 
appears that the model All-3 provides a much better description of the statistics than 
All-1 and All-2. The result also holds for triplets of neurons (data not shown) and still 
holds for a bin size of 20ms. In addition, to explore the effects of including higher 
order spatio-temporal interactions given a range, we show in Figure 5 the same type 
of plot, for a set of N=4 neurons with models of range R=3 with pairs and triplets. So 
triplet terms do enhance statistical description of spatio-temporal patterns. 
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Figure 3: Histogram of differences, 8dxL > between rf^t, for an All-1 model and an All-3 model (left), as 
well as for an All-1 model and an All-2 model. The histogram has been computed for 100 pairs. 




Observed block probability AT(ms) 

Figure 4: The estimated block probability versus the observed block probability for all possible blocks 
from range 1 to 4 (coded by colors), one pair of neurons (pair 3) using All-R models R = 1,2,3 with data 
binned at 10 ms. We include the equality line y = x and the confidence bounds (black lines) for each model, 
corresponding to fl(w) = Tl^' 1 [w] ± 3cv, d w being the standard deviation for each estimated probability 
given the total sample length T ~ 3 ■ 10 5 . The cross-correlation function (CCF) of this pair is also depicted 
(bottom right). 



We also assess the performance of the models in predicting the total number of 
spikes during a given window of time. Figure 6 shows this performance for several 
models, fitted with two different bin sizes. The number of spikes, measured or pre- 
dicted, is counted over 80 and 120 ms windows. The All-2 model already predicts well 
the statistics, and the All-3 model improves marginally the performance. These two 
models are visually almost indistinguishable when a small number of bins is used (i.e, 
bin size of 20ms corresponding to bottom row uses 4 and 6 bins, while the bin size of 
10ms depicted on the upper row uses 8 and 12 bins). 

We examine the coefficients of the parametric estimation by plotting the distri- 
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Figure 6: Distribution of the number of spikes fired by a pair of cells in 80m* (left column) and 120ms 
(right column), compared with predictions by several models: Linear (independent), All-1, All-2,All-3 and 
bin sizes 10ms (upper row) and 20ms (bottom row). 



bution of the monomial coefficients values after estimation of a All-3 model over 50 
different pairs for single spikes, pairs, and triplets. They are depicted in figure 7. Note 
that none of them is centred at zero, in particular triplet terms are not negligible, sug- 
gesting that higher order spatio-temporal interactions do matter. The same conclusion 
holds for groups of three neurons (data not shown). Additionally, we remark that tak- 
ing larger bin sizes (20,50ms) reduces the relative value of coefficients but distribution 
is still not centred. 
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Figure 7: Distribution of the monomial coefficients values has been computed after estimation of a All-3 
model over 50 different pairs, for several choices of bin size. The histogram has been constructed separately 
for single spikes, pairs and triplets. Note that in our framework we consider only non redundant monomials 
so there is a single coefficient for each monomial. For instance, for the monomial ffl,(0) fflj(O) there is only 
one coefficient ij without a symmetrical ji present 



5 Discussion and Conclusion 

In this paper, we have developed a Gibbs distribution analysis for general spatio- 
temporal spike patterns. Our method allows one to handle Markovian models with 
memory up to the limits imposed by the finite size of the data. Our analysis on retina 
data suggests that higher order interaction terms, as well as interaction between non 
consecutive time bins, are necessary to model the statistics of the spatio-temporal spik- 
ing patterns, at least for small populations of neurons. 

An important issue is to determine whether these higher order terms are still es- 
sential when looking at much larger groups of neurons: either the complexity of the 
models will grow with the number of neurons, or adding neurons will have a similar 
effect as uncovering hidden variables, and might then weaken these interactions. The 
extension to large networks of our method is, thus, an important future step to progress 
in our understanding of the spatio-temporal statistics of spike trains. However, the 
identification of the relevant neural subsets in a large number of neurons remains an 
open problem. Moreover, to explore models with larger ranges, one needs to control 
the confidence level due to finite size effects given the available amount of data, which 
can be addressed through Neyman-Pearson results for Markov chains of finite order 
Nagaev (2002). Additionally, the spectrum of the Perron-Frobenius matrix provides 
information about the correlation decay time, which can be used to determine the opti- 
mal range of the model. Both issues are to be developed in a forthcoming paper. 
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